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ABSTRACT 

An important task in biomedical research is identify- 
ing biomarkers that correlate with patient clinical 
data, and these biomarkers then provide a critical 
foundation for the diagnosis and treatment of 
disease. Conventionally, such an analysis is based 
on individual genes, but the results are often noisy 
and difficult to interpret. Using a biological network 
as the searching platform, network-based bio- 
markers are expected to be more robust and 
provide deep insights into the molecular mechan- 
isms of disease. We have developed a novel 
bioinformatics web server for identifying network- 
based biomarkers that most correlate with patient 
survival data, SurvNet. The web server takes three 
input files: one biological network file, representing 
a gene regulatory or protein interaction network; 
one molecular profiling file, containing any type of 
gene- or protein-centred high-throughput biological 
data (e.g. microarray expression data or DNA 
methylation data); and one patient survival data file 
(e.g. patients' progression-free survival data). Given 
user-defined parameters, SurvNet will automatically 
search for subnetworks that most correlate with the 
observed patient survival data. As the output, 
SurvNet will generate a list of network biomarkers 
and display them through a user-friendly interface. 
SurvNet can be accessed at http://bioinformatics 
.mdanderson.org/main/SurvNet. 

INTRODUCTION 

With the advance of genome characterization technology, 
high-throughput genomic and proteomic data of patients 



have accumulated rapidly, allowing the systematic identi- 
fication of biomarkers (1^1). Biomarkers that correlate 
with patient survival data are of particular interest 
because they provide a critical foundation for the diagno- 
sis and treatment of disease (5,6). Conventionally, such an 
analysis is based on individual genes. However, the results 
thereby obtained are often noisy and difficult to interpret 
the underlying mechanisms of disease. Biological networks 
(e.g. gene regulatory networks or protein interaction 
networks) represent a reasonable way to summarize the 
functional behaviours of components within a biological 
system (7-9). Therefore, using a biological network as the 
searching platform, network-based biomarkers (i.e. a 
group of functionally related genes or proteins) are 
expected to be more robust and provide valuable 
insights into the molecular mechanisms of disease. 
Previous studies (10,11) on this topic have focused on 
other clinical data (such as metastasis status), and the 
utility of patient survival data has not been explored. 

In this study, we introduce SurvNet, a novel bioinfor- 
matics web server for identifying network-based 
biomarkers that most correlate with patient survival 
data. The web server takes three input files: one biological 
network file, one molecular profiling file and one patient 
survival data file. In order to identify network-based 
biomarkers, SurvNet uses established algorithms (10-12) 
for searching and evaluating the biomarkers. As the 
output, SurvNet generates a user-friendly display of 
network-based biomarkers. We expect SurvNet to be a 
valuable bioinformatics tool for the biomedical 
community. 

MATERIALS AND METHODS 

The computational approach used by SurvNet to iden- 
tify network-based biomarkers consists of three compo- 
nent processes: (i) a scoring function (combining the 
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subnetwork property, molecular profile and patient 
survival data), (ii) a searching algorithm (for finding the 
candidate biomarkers) and (hi) an evaluation (validating 
the statistical significance of the biomarkers). 

Scoring function 

SurvNet first evaluates each gene (node) i by calculating 
the ^-values p t from a univariable Cox proportional 
hazards regression model (13,14), which quantifies how 
significantly the molecular profiling data of the gene 
correlate with the patient survival data. Then, each gene 
i is assigned a z-score s t transformed from p h 

S, = it" 1 (1 - Pi ), 

as the score for each node in the network, where <P~ l is the 
inverse standard normal cumulative distribution function 
(12). For random data, p t follows a uniform distribution 
from 0 to 1, and by the transformation, s f follows a 
standard normal distribution, with smaller p,- correspond- 
ing to larger z-scores. 

The scoring function F of a subnetwork G with n genes 
is calculated by an aggregate z-score (12), 



where F G follows a standard normal distribution if the s,- 
are independently drawn from a standard normal distri- 
bution. According to the formula, F G is independent with 
a subnetwork of size n. Therefore, subnetworks with 
different sizes are comparable under this score function. 

Searching algorithm 

Because finding the connected subnetworks with the 
maximal score is A^P-hard (12), SurvNet uses a greedy 
searching algorithm, as previously described (10-12). 
The searching starts from a seeded gene i and expands 
iteratively. The algorithm will terminate and output the 
candidate subnetworks if no candidate gene j around the 
current subnetwork G satisfies the following two condi- 
tions: (i) the number of edges in the shortest path 
between j and seeded gene i is smaller than or equal to S 
and (ii) the score of subnetwork G with gene j is higher 
than (1 + p) * F G , where S and p are two pre-determined 
parameters. Specifically, S is used to reduce the searching 
space and p is a fixed increasing rate, ensuring that a new 
gene added to the subnetwork must increase the network 
score F G by a rate larger than or equal to p. 

Evaluation 

SurvNet evaluates the statistical significance of the subnet- 
works identified in the searching step, as previously 
described (12). It first uses random sampling to see if the 
score of a subnetwork is significantly higher than that of a 
random gene set in the network. To do so, SurvNet 
randomly samples gene sets with n genes 10000 times. 
Then, the same scoring function is used to calculate the 
scores for the random gene sets. The population mean /x„ 
and standard deviation a n are estimated from the sampled 



gene sets. Finally, F G is calibrated against this background 
distribution as follows: 

- Fq — fl n 

Fg = ■ 

This calibrated score is the final network score for a 
subnetwork in the output. Moreover, since the multiva- 
riable Cox proportional hazards regression model is 
widely used to quantify the correlation between a group 
of genes and patient survival data, SurvNet also calculates 
the mutivariable Cox P-values for each subnetwork to 
validate their clinical utility. One potential advantage of 
SurvNet is to identify key disease genes that could have 
been missed through single-gene based analyses. For 
example, TP53 is a master cancer gene in ovarian carcin- 
oma. Based on the protein expression and patient survival 
data from a recent study (1), TP53 protein, as a single 
node, shows no significant correlation with the patient 
survival, but a TP53-centered network is among the top 
biomarkers SurvNet identifies. 



WEB SERVER 

Input 

The web server accepts three input files. The first one is a 
biological network file, representing a gene regulatory or 
protein interaction network [a human protein-protein 
interaction network (15) is provided as the default]. This 
file contains all the edges of a biological network, in which 
each line represents an edge. The second file is one 
molecular profiling file, containing any type of gene- or 
protein-centred high-throughput biological data (e.g. 
microarray-based gene expression data, reverse-phase 
protein array (16) protein expression data, DNA methy- 
lation data or gene mutation data). This file is a 
tab-separated numeric matrix, where the column names 
are the sample IDs and row names are gene IDs. The 
third file is one patient survival data file (e.g. patients' 
overall survival time or progression-free survival time). 
This file has three columns, named 'id', 'censor' and 
'time', respectively. 

After uploading the required input files, users can set the 
search distance (Figure 1A). This parameter defines the 
searching area in the network: start with each valid gene 
(or protein) node as the seed, SurvNet will automatically 
search for the optimal subnetwork(s) within this defined 
distance. SurvNet uses the same network searching 
algorithm that was previously described (10,11). A larger 
parameter will require a longer computation time. The 
default search distance is 2. 

Output 

In the final output, the subnetworks that SurvNet 
identifies will first be displayed in a table format 
(Figure IB). These results can be directly downloaded. 
The network files are in a '.dot' format that can be 
visualized by GraphViz (http://www.graphviz.org). As 
shown in Figure 1C, the identified subnetworks are 
ranked within the table according to the network score, 
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-Upload datasets and set parameters 

Input Files: 

6 Try out a set of sample data files (no input files are required) 

0 Use the HPRD human protein-protein interaction network as default network (two input 
files are required) 

6 Manually upload all data files (three input files are required) 



Search Parameters: 

Search distance: ; "71 
Job Submission: 





Result 

5.04657 
4.85955 



rtt Score Cox p-value Gene ID 



2.10527e-03 
3.40354e-O3 
2.23224e-04 
1 .37674e-04 
2.93847e-03 
2.49578e-03 
1.743146-03 
3.41247e-03 
2.82684e-03 
2.82684e-03 
4.56324e-03 
5.05771 e-03 
1.60015e-03 
2.92528e-03 
1.53031 e-03 
1.53031 e-03 



B 




Figure 1. Snapshots of the SurvNet web server. (A) Input page, through which input files and a search parameter can be specified. (B) Output page, 
on which the top subnetwork biomarkers identified are displayed. (C) Visualization page, on which subnetworks can be visualized in a user-friendly 
way. 



from high to low score. Each score is associated with the 
output items that follow. The network score, which 
quantifies how significantly the nodes in a subnetwork 
correlate with the observed patient survival data, is 
calculated based on the univariable Cox proportional 
hazards model and the network properties. A higher 
network score indicates a more significant correlation 
between the network and the patient survival time. The 
'Cox P-value' is the P-value derived from the 
multivariable Cox proportional hazards regression 
model. The 'gene_ID' indicates the seeded node for each 
subnetwork; the number of nodes indicates how many 
genes (or proteins) are in the subnetwork; and the 
number of edges indicates how many interactions are in 
the subnetwork. Users can further narrow down the 
results by two output parameters: network P-value 
cut-off and minimal number nodes. The network 
P-value cut-off determines how significant the returned 
subnetworks are compared to the random background; 
and the default significance level is 0.05. Minimal 
number nodes determine the minimal number of nodes 
in a subnetwork; the default value is 2. 

After clicking the 'Graph' button in the final output 
page, users can visualize an identified subnetwork in a 
user-friendly Java applet that allows them to pan/zoom, 
search and retrieve useful information (from GeneCard) 
for a node of interest. A detailed description about the 
Java applet is available under the visualization page. 



CONCLUSION 

We have developed SurvNet, a web server that can effi- 
ciently identify network-based biomarkers that most cor- 
relate with patient survival data. To the best of our 
knowledge, SurvNet is the only available bioinformatics 
tool for this function. SurvNet uses the network-based 
biomarker searching algorithms that were established in 
previous studies, and provides a user-friendly interface for 
exploring the identified biomarkers. We expect SurvNet to 
be a valuable resource for generating meaningful 
hypotheses for disease diagnosis and treatment. 
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